Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus’s spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.
In this blog, we are interested to know ‘How the world’s news media is covering the COVID-19 pandemic?’ Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.
A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.
We intend to solve the huge flow of information called “information overload” which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. It’s costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.
We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.
Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited and interpretable for topic modeling on our dataset ?
3. How does the topic model perform with different features, namely Term frequency–Inverse document frequency (Tf - Idf) along with Bag of Words and Bag of words (TF) by itself.
Data source
For our dataset we required news articles that spoke about the ongoing coronavirus pandemic. In our search, we came across the Gdelt Project, that contained a compilation of URLs and brief snippets of worldwide English language news coverage mentioning Covid-19. It contains data from the the period November 1, 2019 through March 26, 2020. Gdelt dataset: http://data.gdeltproject.org/blog/2020-coronavirus-narrative/live_onlinenews/MASTERFILELIST.TXT
Scraping Method
On digging deeper into the dataset we realized that only snippets of the news articles were included.The snippets were chosen by performing a keyword search for the given terms: Cases, Covid19, Falsehoods, Masks, Panic, Prices, Quarantine, Shortages, SocialDistancing, Testing and Ventilators; and selecting the paragraph with its first occurrence. In addition to the presence of one of the given terms, either the sentence itself or the one’s before and after them should also contain the term “Coronavirus” or “Covid-19”, thus ensuring that the news article is realted to coronavirus.
The Gdelt dataset had news articles related to coronavirus, but just a snippet wouldn’t be suffiecient to understand the underlying topic of an article. Hence, we decided to scrape the articles by ourselves by using urls corresponding to each article of the Gdelt dataset.
The dataset contained several files, each containing articles extracted on a particular day, having a particular keyword. As considering all the articles in each file would be computationally too heavy & infeasible, we agreed on creating a dataset having around 20000 records. We realize the topics discussed during the initial period of the pandemic and in the months to follow must have evolved. In order to capture the wide array of topics over the duration of 5 months, we first downloaded all files. Then for all the files belonging to a keyword we extract certain records. This was repeated for all keywords. Thus at the end of the extraction process we had around 20000 news articles as our final dataset.
Cleanup
As the content we extracted were from websites, it contained numerous html tags and special characters. In the preprocessing stage,we first converted the data to lower case. We then cleaned the data by removing the urls(www, http), punctuations, special charachters, stopwords and also stripped the whitespaces in it. Once the preprocessing was complete the preprocessed corpus was ready for analysis.
Storage
The dataset after preprocessing was stored in a csv file and uploaded on the drive. Dataset: https://drive.google.com/file/d/1qgQiIIi1yhXBj1jAOVz_2dhNT4C2i6bc/view?usp=sharing
…
Our Dataset is in text format and therefore we pre-processed it before performing any kind of exploratory analysis. This was required in order to clean it and remove unnecessary words or characters that would affect our analysis in any way.Pre-processing is one of the very important steps of Natural Language processing, because a well pre-processed data speeds up the computation time required for further analysis and also the quality of tokens and results tend to be higher compared to the poorly pre-processed data.
Steps taken for Pre-processing
* Removed URL’s from the content * Replaced punctuations, numbers and any other characters apart from alphabets * Coverted Latin words to Utf-8 * Conerted the text to lower case * Removed Stop words
Wordclouds are a representative of underlying words in any text or the news articles dataset in our case.We’ve generated wordclouds for 2 different models of Bag of Words, that are with Term Frequency and TF-IDF. We wanted to analyse and understand how does the size and type of words differ when the weighting scheme changes for the same corpus.
As we can see in the below wordcloud, news articles have been all about the coronavirus pandemic.The Terms with higher frequencies are the ones bigger in size. Since the method used for below wordcloud is TF the most
wordcloud2(d_bow,shape = "star",size = 0.4)
…
Using Bag of Words Model with TF-IDF Weighting scheme.
wordcloud2(d_tfidf,shape = "star",size = 0.15)
…
We’ve used elbow method along with other metrics to determine optimal topics for Topic Modelling.
Topic modelling was done considering different values of Topics. Our intention in this process was to tune the hyperparameter which is no of topics and thus, choose no of topics for the best performing model between BOW with TF vs BOW with TF-IDF. Perplexity score is the measure of interpretability of topics.
Perplexity Score
#Plot
ggparcoord(bow_test_train_model1,
columns = 1:25, groupColumn = 26,
scale = 'uniminmax',
showPoints = TRUE,
title = "Parallel Coordinate Plot For Model 1 with BOW as FE",
alphaLines = 0.1
) + scale_color_viridis(discrete=TRUE) +
theme_ipsum()+
theme(plot.title = element_text(size=8))
Plot 1 for Gibbs Sampling as Model 1
#Plot
ggparcoord(bow_test_train_model2,
columns = 1:25, groupColumn = 26,
scale = 'uniminmax',
showPoints = TRUE,
title = "Parallel Coordinate Plot For Model 2 with BOW as FE",
alphaLines = 0.1
) + scale_color_viridis(discrete=TRUE) +
theme_ipsum()+
theme(plot.title = element_text(size=8))
Plot 2 for Dot Product as Model 2
Below metric was used for evaluating model.
#Plot
Likelihood Score
The first step was to reduce teh dimensionality Using Tsne from a feature set of 1000’s of columns representing words for the BOW with TF weighting scheme.
#Reducing the dimensions via tsne
tsne <- Rtsne(doc_topics_gamma[,-1], perplexity = 30, pca = FALSE, check_duplicates = FALSE)
X <- data.frame(tsne$Y)
#Find best no. of clusters for 25 topics
wss <- (nrow(X)-1)*sum(apply(X,2,var))
for (i in 1:100) wss[i] <- sum(kmeans(X,iter.max = 50L,centers=i)$withinss)
plot(1:100, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Elbow Curve Plot
####clustering via k means
k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)
fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters
Silhouette Coefficient Plot 1
Silhouette Coefficient Plot 2
Sankey Network Diagram
links <- data.frame(
source = top_terms$topic,
target = top_terms$term,
value = top_terms$beta
)
nodes <- data.frame(
name=c(as.character(links$source),
as.character(links$target)) %>% unique()
)
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
sinksRight=FALSE,fontSize = 16,height = 1400,width = 1200,
nodePadding = 8, fontFamily = "arial",unit = "Letter(s)")
p
Chord Diagram
chordDiagram(new_v,big.gap = 10,directional = 1, direction.type = c("diffHeight", "arrows"),link.arr.type = "big.arrow", diffHeight = -mm_h(1),grid.col = c("violet", "blue4", "blue","green", "yellow","tomato","red","cyan4","deeppink","cyan3","chocolate4","darkslategrey","darksalmon","chartreuse","darkorchid2","deepskyblue1","lightcoral", "palegreen4", "paleturquoise2","palevioletred", "peru", "pink4", "purple2","sienna1","skyblue2","seagreen2","rosybrown","plum3","slateblue2","orange3","darkgoldenrod2","salmon2","pink2")
Chord Diagram for Topic to Cluster association
bp <- ggplot(temp, aes(x= Topic_Number,
y=Topic_Probability, group = 1)) +
geom_line(color = "steelblue",size = 2) +
geom_point(size = 2) +
labs(title = "Topic Distribution in each Cluster",
y = "Average Probability", x = "Topic Number")
bp +facet_grid(Cluster_Number ~ .)
Probability Distribution of Topics in each cluster
Terms in Topics
#####R bokeh plot :
figure(title = "Rbokeh plot representing Documents and topics", width = 1200, height = 600) %>%
ly_points(x = X1, y = X2,data = bow_test_train_model1,color = type,
hover=c(topic_highest_prob,title,url,keywords),size = 3
) %>%
set_palette(discrete_color = pal_color(c( "red", "skyblue")))